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Gaussian Processes for Data-Efficient Learning 

in Robotics and Control 

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen 


Abstract —Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven 
learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement 
learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real 
systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning 
approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or 
specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting 
more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. 
By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model 
errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an 
unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks. 

Index Terms —Policy search, robotics, control, Gaussian processes, Bayesian inference, reinforcement learning 
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1 Introduction 

O NE of the main limitations of many current rein¬ 
forcement learning (RL) algorithms is that learn¬ 
ing is prohibitively slow, i.e., the required number of 
interactions with the environment is impractically high. 
For example, many RL approaches in problems with 
low-dimensional state spaces and fairly benign dynamics 
require thousands of trials to learn. This data inefficiency 
makes learning in real control/robotic systems imprac¬ 
tical and prohibits RL approaches in more challenging 
scenarios. 

Increasing the data efficiency in RL requires either 
task-specific prior knowledge or extraction of more in¬ 
formation from available data. In this article, we assume 
that expert knowledge (e.g., in terms of expert demon¬ 
strations [48], realistic simulators, or explicit differential 
equations for the d 5 mamics) is unavaiable. Instead, we 
carefully model the observed d 5 mamics using a general 
flexible nonparametric approach. 

Generally, model-based methods, i.e., methods which 
learn an explicit d 3 rnamics model of the environment, are 
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more promising to efficiently extract valuable informa¬ 
tion from available data [5] than model-free methods, 
such as Q-learning [55] or TD-1 earning [52]. The main 
reason why model-based methods are not widely used in 
RL is that they can suffer severely from model errors, i.e., 
they inherently assume that the learned model resembles 
the real environment sufficiently accurately [5], [48], [49]. 
Model errors are especially an issue when only a few 
samples and no informative prior knowledge about the 
task are available. Fig. 1 illustrates how model errors 
can affect learning. Given a small data set of observed 
transitions (left), multiple transition functions plausibly 
could have generated them (center). Choosing a single 
deterministic model has severe consequences: Long-term 
predictions often leave the range of the training data in 
which case the predictions become essentially arbitrary. 
However, the deterministic model claims them with full 
confidence! By contrast, a probabilistic model places a 
posterior distribution on plausible transition functions 
(right) and expresses the level of uncertainty about the 
model itself. 

When learning models, considerable model uncer¬ 
tainty is present, especially early on in learning. Thus, 
we require probabilistic models to express this uncer¬ 
tainty. Moreover, model uncertainty needs to be in¬ 
corporated into planning and policy evaluation. Based 
on these ideas, we propose PILCO (Probabilistic Infer¬ 
ence for Learning Control), a model-based policy search 
method [15], [16]. As a probabilistic model we use non¬ 
parametric Gaussian processes (GPs) [47]. PiLCO uses 
computationally efficient deterministic approximate in¬ 
ference for long-term predictions and policy evaluation. 
Policy improvement is based on analytic policy gradi¬ 
ents. Due to probabilistic modeling and inference PILCO 
achieves unprecedented learning efficiency in continu- 
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Fig. 1. Effect of model errors. Left: Small data set of observed transitions from an idealized one-dimensional 
representations of states and actions {xt, ut) to the next state xt+i = f{xt,ut). Center: Multiple plausible deterministic 
models. Right: Probabilistic model. The probabilistic model describes the uncertainty about the latent function by a 
probability distribution on the set of all plausible transition functions. Predictions with deterministic models are claimed 
with full confidence, while the probabilistic model expresses its predictive uncertainty by a probability distribution. 


ous state-action domains and, hence, is directly applica¬ 
ble to complex mechanical systems, such as robots. 

In this article, we provide a detailed overview of 
the key ingredients of fhe PILCO learning framework. 
In parficular, we assess fhe qualify of fwo differenf 
approximafe inference mefhods in fhe confexf of policy 
search. Moreover, we give a concrefe example of fhe 
imporfance of Bayesian modeling and inference for fasf 
learning from scrafch. We demonsfrafe fhat PiLCO's un¬ 
precedented learning speed makes it directly applicable 
to realistic control and robotic hardware platforms. 

This arficle is organized as follows: Affer discussing 
relafed work in Sec. 2, we describe fhe key ideas of 
fhe PILCO learning framework in Sec. 3, i.e., fhe d 3 mam- 
ics model, policy evaluafion, and gradienf-based policy 
improvement. In Sec. 4, we detail two approaches for 
long-ferm predicfions for policy evaluafion. In Sec. 5, we 
describe how fhe policy is represenfed and practically 
implemenfed. A parficular cosf function and ifs natural 
exploration/exploitation trade-off are discussed in Sec. 6. 
Experimenfal resulfs are provided in Sec. 7. In Sec. 8, we 
discuss key properties, limifafions, and exfensions of fhe 
PILCO framework before concluding in Sec. 9. 

2 Related Work 

Confrolling sysfems under paramefer uncerfainfy has 
been invesfigafed for decades in robusf and adaptive 
confrol [4], [35]. Typically, a certainly equivalence prin¬ 
ciple is applied, which freafs esfimafes of fhe model pa- 
ramefers as if fhey were fhe frue values [58]. Approaches 
fo designing adaptive controllers that explicitly take 
uncertainty about the model parameters into account are 
stochastic adaptive control [4] and dual control [23]. Dual 
control aims to reduce parameter uncertainty by explicit 
probing, which is closely related to the exploration prob¬ 
lem in RL. Robust, adaptive, and dual control are most 
often applied to linear systems [58]; nonlinear extensions 
exist in special cases [22]. 

The specification of paramefric models for a parficular 
confrol problem is offen challenging and requires intri¬ 
cafe knowledge abouf fhe sysfem. Somefimes, a rough 
model esfimafe wifh uncerfain paramefers is sufficienf fo 
solve challenging confrol problems. For insfance, in [3], 


fhis approach was applied fogefher with locally optimal 
controllers and temporal bias terms for handling model 
errors. The key idea was fo ground policy evaluafions 
using real-life frials, buf nof fhe approximafe model. 

All above-mentioned approaches to finding confrollers 
require more or less accurafe parametric models. These 
models are problem specific and have fo be manually 
specified, i.e., fhey are nof suifed for learning models 
for a broad range of fasks. Nonparamefric regression 
mefhods, however, are promising fo aufomafically ex- 
fracf fhe imporfanf feafures of fhe lafenf d 5 mamics from 
dafa. In [7], [49] locally weighfed Bayesian regression 
was used as a nonparamefric mefhod for learning fhese 
models. To deal wifh model uncerfainfy, in [7] model 
paramefers were sampled from fhe paramefer posfe- 
rior, which accounfs for femporal correlation. In [49], 
model uncertainty was treated as noise. The approach 
to controller learning was based on stochastic d 5 mamic 
programming in discretized spaces, where the model 
errors at each time step were assumed independent. 

PiLCO builds upon the idea of freafing model uncer¬ 
fainfy as noise [49]. However, unlike [49], PILCO is a 
policy search mefhod and does nof require sfafe space 
discrefization. Insfead closed-form Bayesian averaging 
over infinifely many plausible d 3 mamics models is pos¬ 
sible by using nonparamefric GPs. 

Nonparamefric GP d 5 mamics models in RL were pre¬ 
viously proposed in [17], [30], [46], where fhe GP fram¬ 
ing dafa were obfained from "motor babbling". Unlike 
PILCO, fhese approaches model global value functions fo 
derive policies, requiring accurafe value function mod¬ 
els. To reduce fhe effecf of model errors in fhe value func¬ 
tions, many dafa poinfs are necessary as value functions 
are offen discontinuous, rendering value-function based 
mefhods in high-dimensional sfafe spaces offen sfafis- 
fically and compufafionally impracfical. Therefore, [17], 
[19], [46], [57] propose fo learn GP value function models 
fo address fhe issue of model errors in fhe value function. 
However, fhese mefhods can usually only be applied 
fo low-dimensional RL problems. As a policy search 
mefhod, PILCO does nof require an explicif global value 
function model buf rafher searches direcfly in policy 
space. However, unlike value-function based mefhods. 
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Algorithm 1 PILCO 

1: init: Sample controller parameters 0 ^ JV{0,1). 

Apply random control signals and record data. 

2 : repeat 

3: Learn probabilistic (GP) dynamics model, see 

Sec. 3.1, using all data 
4: repeat 

5: Approximate inference for policy evaluafion, see 

Sec. 3.2: gef .^{6), Eq. (9)-(ll) 

6: Gradienf-based policy improvemenf, see 

Sec. 3.3: gef d.r{e)/de, Eq. (12)-(16) 

7: Updafe paramefers 6 (e.g., CG or L-BEGS). 

8: until convergence; return 6* 

9: Set TT* ^ 7 f(0*) 

10: Apply TT* to system and record data 

11: until task learned 


PILCO is currently limited to episodic set-ups. 


3 Model-based Policy Search 

In this article, we consider d 5 rnamical systems 

Xt+i = f{xt,Ut) + W , m ~ A/'(0, Su,), (1) 

with continuous-valued states x c and controls 
u c R^, i.i.d. Gaussian system noise w, and unknown 
transition d 5 mamics /. The policy search objective is to 
find a policy/controller tt : x ^ 7r(a:, 9) = u, which 
minimizes fhe expected long-term cost 

= aio ~ A/'(/ro,So), (2) 

of following TT for T sfeps, where c{xt) is fhe cosf of 
being in sfafe x af fime t. We assume fhaf tt is a funcfion 
paramefrized by 0} 

To find a policy tt*, which minimizes (2), PILCO 
builds upon fhree componenfs: 1) a probabilisfic GP 
dynamics model (Sec. 3.1), 2) deferminisfic approximafe 
inference for long-ferm predictions and policy evalu¬ 
afion (Sec. 3.2), 3) analyfic compufafion of fhe policy 
gradienfs dJ'^{6)/dO for policy improvemenf (Sec. 3.3). 
The GP model infernally represenfs fhe dynamics in (1) 
and is subsequenfly employed for long-ferm predictions 
p(a;i|7r),... ,p(a;j’|7r), given a policy tt. These predictions 
are obfained fhrough approximafe inference and used 
fo evaluafe fhe expecfed long-ferm cosf J'^{9) in (2). 
The policy tt is improved based on gradienf informa¬ 
tion dJ'^{9)/d9. Alg. 1 summarizes fhe PILCO learning 
framework. 

1. In our experiments in Sec. 7, we use a) nonlinear parametrizations 
by means of RBF networks, where the parameters 0 are the weights 
and the features, or b) linear-affine parametrizations, where the pa¬ 
rameters 9 are the weight matrix and a bias term. 


3.1 Model Learning 

PiLCO's probabilisfic dynamics model is implemenfed 
as a GP, where we use fuples {xt,ut) C as 

framing inpufs and differences A( = xt+i — Xt G R^ as 
framing fargefs.^ A GP is complefely specified by a mean 
funcfion m( •) and a posifive semidefinife covariance 
funcfion/kernel In this paper, we consider a 

prior mean funcfion m = 0 and fhe covariance funcfion 

k{xp,Xq) = ajexp (-l(&p-*,)^A~^(*p-*q))-hdpqcr^ 

(3) 

wifh X := We defined A := diag([ff,... ,f|)_|_p]) 

in (3), which depends on fhe characferisfic lengfh-scales 
£i, and a'j is fhe variance of fhe lafenf fransifion func¬ 
fion /. Given n framing inpufs X = [xi,... ,Xn] and 
corresponding framing fargefs y = [Ai,..., A„]^, fhe 
posferior GP h 5 rper-paramefers (lengfh-scales A, signal 
variance and noise variance cr^) are learned by 
evidence maximizafion [34], [47]. 

The posferior GP is a one-sfep prediction model, and 
fhe predicfed successor sfafe Xt+i is Gaussian disfribufed 

p{xt+i\xt, Ut) = M{xt+i I St+i) (4) 

Mt+i = ait-hE/[At], St+i =var/[At], (5) 

where fhe mean and variance of fhe GP predicfion are 

Ey[At] = m/(*t) = kl{K + all)-^y = kj(3, (6) 

vary[At] = - fcJ(lT-h , (7) 

respectively, wifh fc* := k{X,Xt), A:** := k{xt,xt), and 
/3 := {K -\- a‘^I)~^y, where K is fhe kernel mafrix wifh 
enfries Kij = k{xi,Xj). 

Por mulfivariafe fargefs, we frain conditionally inde- 
pendenf GPs for each fargef dimension, i.e., fhe GPs are 
independenf for given fesf inpufs. Eor uncerfain inpufs, 
fhe fargef dimensions covary [44], see also Sec. 4. 

3.2 Policy Evaluation 

To evaluafe and minimize in (2) PILCO uses long- 
ferm predictions of fhe sfafe evolution. In particular, we 
defermine fhe marginal f-sfep-ahead predictive disfri- 
bufions p{xi\tt), ... ,p{xt\tt) from fhe initial sfafe dis- 
fribufion p(xo), t = 1,... ,T. To obfain fhese long-ferm 
predictions, we cascade one-sfep predictions, see (4)-(5), 
which requires mapping uncerfain fesf inpufs fhrough 
fhe GP d 5 mamics model. In fhe following, we assume 
fhaf fhese fesf inpufs are Gaussian disfribufed. Por nofa- 
fional convenience, we omif fhe explicif condifioning on 
fhe policy tt in fhe following and assume fhaf episodes 
sfarf from £Co ~ p{xq) = M(xq \ /Xg, Eg). 

Por predicting Xt+i from p{xt), we require a joinf 
disfribufion p{xt) = p{xt,Ut), see (1). The confrol Ut — 
TT{xt,9) is a funcfion of fhe sfafe, and we approximafe 

2. Using differences as training targets encodes an implicit prior 
mean function m{x) = x. This means that when leaving the training 
data, the GP predictions do not fall back to 0 but they remain constant. 
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Fig. 2. GP prediction at an uncertain input. The input 
distribution p{xt,ut) is assumed Gaussian (iower ieft 
panei). When propagating it through the GP modei (upper 
ieft panei), we obtain the shaded distribution p(A(), upper 
right panei. We approximate p{At) by a Gaussian (upper 
right panei), which is computed by means of either mo¬ 
ment matching (biue) or iinearization of the posterior GP 
mean (red). Using iinearization for approximate inference 
can iead to predictive distributions that are too tight. 


the desired joint distribution p{xt) = p{xt,Ut) by a 
Gaussian. Details are provided in Sec. 5.5. 

From now on, we assume a joint Gaussian distribution 
distribution p{xt) = N{xt \ /ij, S*) at time t. To compute 

p{At) = jj p{f{xt)\xt)p{xt) d/d*t, (8) 

we integrate out both the random variable xt and the 
random function /, the latter one according to the pos¬ 
terior GP distribution. Gomputing the exact predictive 
distribution in (8) is analytically intractable as illustrated 
in Fig. 2. Hence, we approximate p{At) by a Gaussian. 

Assume the mean and the covariance Sa of fhe 
predictive disfribufion p{At) are known^. Then, a Gaus¬ 
sian approximafion fo fhe desired predicfive disfribufion 
p{xt+i) is given as N{xt+i \ St+i) wifh 

Mt+i = A*t + Ma j (9) 

St+i = St -I- Sa -F covjait, At] -F cov[At, ait] ■ (10) 

Nofe fhaf bofh and Sa are funcfions of fhe mean 
/i.„ and fhe covariance S„ of fhe confrol signal. 

To evaluafe fhe expecfed long-ferm cosf J'^ in (2), if 
remains fo compufe fhe expecfed values 

lEx,[c(a;t)] = j c[xt)M {xt\ pit,lit) <^xt, (H) 

t = 1,...,T, of fhe cosf c wifh respecf fo fhe predic¬ 
five sfafe disfribufions. We choose fhe cosf c such fhaf 
fhe infegral in (11) and, fhus, in (2) can compufed 
analyfically. Examples of such cosf funcfions include 
pol 5 momials and mixfures of Gaussians. 

3.3 Analytic Gradients for Policy Improvement 

To find policy paramefers 6, which minimize J'^{6) in 
(2), we use gradienf information dJ'^{6)/ A9. We require 

3. We will detail their computations in Secs. 4.1^.2. 


fhaf fhe expecfed cosf in (11) is differentiable wifh respecf 
fo fhe momenfs of fhe sfafe disfribufion. Moreover, we 
assume fhaf fhe momenfs of fhe confrol disfribufion 
and S„ can be compufed analyfically and are differen¬ 
tiable wifh respecf fo fhe policy paramefers 9. 

In fhe following, we describe how fo analyfically com¬ 
pufe fhese gradienfs for a gradienf-based policy search. 
We obfain fhe gradienf dJ'^ / d9 by repeafed applicafion 
of fhe chain-rule: Firsf, we move fhe gradienf info fhe 
sum in (2), and wifh 8t '■= Ea.Jc(a;t)] we obfain 


dJ’^(6>) 

d9 

d9 


dft 

d£t Apjxt) _ d£t d/xt 
dp(a;t) d9 dfit d9 


d£t dSt 
dSt d9 


( 12 ) 


where we used fhe shorfhand nofafion d£t/ dp{xt) = 
{d£t/ d/Xj, d£t/dSt} for faking fhe derivative of £t wifh 
respecf fo bofh fhe mean and covariance of p{xt) = 
N{xt \ pit^'^t)■ Second, as we will show in Sec. 4, fhe 
predicfed mean /x^ and covariance St depend on fhe 
momenfs of p(xt-i) and fhe confroller paramefers 9. By 
applying fhe chain-rule fo (12), we obfain fhen 


dpjxt) _ dp{xt) dp{xt-i) dpjxt) 

d9 dp{xt-i) d9 d9 

dpjxt) _ ( dfit dHt \ 

dp{xt-i) \dp{xt-i)’ dp{xt-i)) ' 


(13) 

(14) 


From here onward, we focus on d/Xt/ d9, see (12), buf 
computing dSt/ d9 in (12) is similar. For d/Xj/ d9, we 
compufe fhe derivafive 

d/Tt ^ dp.t d/xt_i d/At dSt-i ^ 

d6> d/xt_i d9 dSt-i d9 d9 ' ^ ’ 

Since dp(a;t_i) /dd in (13) is known from time sfep t—1 
and dpit/dp{xt-i) is compufed by applying fhe chain- 
rule fo (17)-(20), we conclude wifh 


d/xt _ d/x^ dp{ut-i) _ dpL^ dpL^ dpi^ dS„ 
d9 dpiut-i) 39 dpL^ 39 dS„ 39 


The parfial derivafives of /x„ and S„, i.e., fhe mean and 
covariance of p{ut), used in (16) depend on fhe policy 
represenfafion. The individual parfial derivafives in (12)- 
(16) depend on fhe approximafe inference mefhod used 
for propagafing sfafe disfribufions fhrough time. For 
example, wifh momenf mafching or linearizafion of fhe 
posferior GP (see Sec. 4 for defails) fhe desired gradienfs 
can be compufed analyfically by repeafed applicafion of 
fhe chain-rule. The Appendix derives fhe gradienfs for 
fhe momenf-mafching approximafion. 

A gradienf-based opfimizafion mefhod using estimates 
of fhe gradienf of J^{9) such as finife differences or 
more efficienf sampling-based mefhods (see [43] for an 
overview) requires many function evaluafions, which 
can be compufafionally expensive. However, since in 
our case policy evaluafion can be performed analyfically, 
we profif from analyfic expressions for fhe gradienfs, 
which allows for sfandard gradienf-based non-convex 
opfimizafion mefhods, such as GG or BFGS, fo defermine 
opfimized policy paramefers 9*. 
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4 Long-Term Predictions 

Long-term predictions p{xi),... ,p{xt) for a given pol¬ 
icy parametrization are essential for policy evaluation 
and improvement as described in Secs. 3.2 and 3.3, 
respectively. These long-term predictions are computed 
iteratively: At each time step, PILCO approximates the 
predictive state distribution p{xt+i) by a Gaussian, see 
(9)-(10). For this approximation, we need to predict with 
GPs when the input is given by a probability distribution 
p{xt), see (8). In this section, we detail the computations 
of the mean and covariance matrix Sa of the GP 
predictive distribution, see (8), as well as the cross¬ 
covariances cov[it, At] = cov[[£c 7, At], which are 
required in (9)-(10). We present two approximations to 
predicting with GPs at uncertain inputs: Moment match¬ 
ing [15], [44] and linearization of the posterior GP mean 
function [28]. While moment matching computes the first 
two moments of the predictive distribution exactly, their 
approximation by explicit linearization of the posterior 
GP is computationally advantageous. 

4.1 Moment Matching 

Following the law of iterated expectations, for target 
dimensions a = 1,..., D, we obtain the predictive mean 

Pa = IEit[fE/a[/a(*t)|*t]] = 

= J {xt)M{xt 1 fit. %) dxt = (3jqa , ( 17 ) 

l3, = {K,+alj-^y,, (18) 

with = [^ai, ■ ■ • 7 <Za„]^- The entries of G R” are 
computed using standard results from multiplying and 
integrating over Gaussians and are given by 

= J ka{x„Xt)J\f(xt\flt,'St)dXt (19) 

= +I\~^ exp ( - kvj{'Et + , 

where we define 

Vi := {xi - fit) (20) 

as the difference between the training input Xi and the 
mean of the test input distribution p{xt,uf). 

Computing the predictive covariance matrix Sa € 
requires us to distinguish between diagonal el¬ 
ements and off-diagonal elements a ^ b: Using 
the law of total (co-)variance, we obtain for target di¬ 
mensions a,b = 1,... ,D 

= IE*t[var/[Aal*t]] -h - (/x^)2 , (21) 

= %.57jAaAb]-/x^/i^ , ay^b, (22) 

respectively, where Pa known from (17). The off- 
diagonal terms cr^j, do not contain the additional term 
EsJcov/[Aa, Ahj&t]] because of the conditional indepen¬ 
dence assumption of the GP models: Different target 
dimensions do not covary for given Xf 

We start the computation of the covariance matrix with 
the terms that are common to both the diagonal and the 


off-diagonal entries: With p{xt) = Af{xt \ fit. ^t) arid the 
law of iterated expectations, we obtain 

E/,*JAaAh] = Es, [E/[Aojit] E/[Ahj&t]] 

= Jm'f{xt)m^f{xt)p{xt)dxt (23) 

because of the conditional independence of Aa and A?, 
given Xf Using the definition of the GP mean function 
in (6), we obtain 

Ey.iJA,Ab]=/3jg/3,, (24) 

Q = J ka{xt,X)^ kb{xt,X)p{xt)dxt ■ (25) 

Using standard results from Gaussian multiplications 
and integration, we obtain the entries Qij of Q G R"^” 

Qij — |.f^| kai^Xi^ fit)kbi^xj^ fit) crycp i^-2^ij^ (7b) 

where we define 

R := + A,-i) + J, T := A-i + A,"' + S,”', 

:= A^Vi -f , 

with Ui defined in (20). Hence, the off-diagonal entries of 
Sa are fully determined by (17)-(20), (22), and (24)-(26). 

From (21), we see that the diagonal entries contain the 
additional term 

Ei, [var/[Aal*t]] =a-\ - ir[{Ka + (Tl,J)~^Q) + (27) 

with Q given in (26) and being the system noise 
variance of the ath target dimension. This term is the 
expected variance of the function, see (7), under the 
distribution p{xt). 

To obtain the cross-covariances cov[a:(,Ai] in (10), we 
compute the cross-covariance cov[a;i,At] between an 
uncertain state-action pair Xt ~ Af{fit, Sj) and the cor¬ 
responding predicted state difference xt+i — xt = At ^ 
Af{fiy^, Sa). This cross-covariance is given by 

cov[xt, At] — Wfxtjl^tAt ] fitti/x.. (78) 

where the components of Pa given in (17), and fit 
is the known mean of the input distribution of the state- 
action pair at time step t. 

Using the law of iterated expectation, for each state 
dimension a = 1,..., D, we compute E^., j[it A“] as 

Ei,j[*t A“] = EiJitE/[A“lit]] =Jxtm‘f{xt)p{xt)dxt 

/ n 

Xt[^l3aik‘}{xt.x,)^p{xt)dxt, (79) 

i=i 

where the (posterior) GP mean function mf(xt) was 
represented as a finite kernel expansion. Note that x^ 
are the state-action pairs, which were used to train the 
dynamics GP model. By pulling the constant da, out of 
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the integral and changing the order of summation and 
integration, we obtain 

n « 

= Cl N{xt\xi, Aa)N{xt\ij,t,'t>t) dxt , (30) 

=fc“(xt,Xi) p(xt) 

D+F 1 

where we define ci := (27r) 2 |Aa|2 wifh x e 

]R^+^, such fhaf k'j{xt,Xi) = ciAf (xt\xi, Aa) is an 
urmormalized Gaussian probabilify disfribufion in xt, 
where Xi, i = are fhe GP framing inpufs. 

The producf of fhe fwo Gaussians in (30) yields a new 
(unnormalized) Gaussian | wifh 

, D+F . 1 

C2^ = (27r) 2 |Aa + St| 2 

X exp ( - \{x, - i+tV{Aa + St)~^(*i - Ai)). 

^ = (A-1 + = '&(A-1*, + y Vt) • 

By pulling all remaining variables, which are indepen- 
denf of Xt, ouf of fhe infegral in (30), fhe infegral 
defermines fhe expecfed value of fhe producf of fhe fwo 
Gaussians, Hence, we obfain 

\ 1 

lEi,,/[*t A“] = y].^^CiC^Va,'0i, a = 

E Th _ 

.^^CiC^ (31) 

for all predicfive dimensions a = 1,... ,E. Wifh cic^^ = 
qa„ see (19), and 1 /)^ = %{'Et+Aa)~'^Xi+A{'St+Aa)~'^ij,t 
we simplify (31) and obfain 

n 

c0V5.,j[&t, A“] = y]/3o,ga.S((i:t+Aa)"^(&,-At). (32) 

a = l,...,ii^. The desired covariance cov[a;t,At] is a 
D X E submafrix of fhe (D + F) x E cross-covariance 
compufed in fo (32). 

A visualization of fhe approximafion of fhe predicfive 
disfribufion by means of exacf momenf mafching is 
given in Fig. 2. 

4.2 Linearization of the Posterior GP Mean Function 

An alternative way of approximating the predictive dis¬ 
tribution p(At) by a Gaussian for xt ~ Af{xt 
is to linearize the posterior GP mean function. Fig. 2 
visualizes the approximation by means of linearizing the 
posterior GP mean function. 

The predicted mean is obtained by evaluating the pos¬ 
terior GP mean in (5) at the mean p,t of the input 
distribution, i.e., 

Ma =E/[/a(Mt)] =mf^{flt) = (iZka{X, fit), (33) 

a = 1 ,..., i?, where /3„ is given in (18). 

To compute the GP predictive covariance matrix 
we explicitly linearize the posterior GP mean function 
around fif By applying standard results for mapping 


Gaussian distributions through linear models, the pre¬ 
dictive covariance is given by 

Sa = VttV^ + llu,, (34) 

( 35 ) 

dfit dfit 

In (34), Suj is a diagonal matrix whose entries are 
the noise variances plus the model uncertainties 
var/[A“|/i(] evaluated at fit, see (7). This means, model 
uncertainty no longer depends on the density of the data 
points. Instead it is assumed to be constant. Note that the 
moments computed in (33)-(34) are not exact. 

The cross-covariance cov[a;i, At] is given by TitV, where 
V is defined in (35). 

5 Policy 

In the following, we describe the desired properties of 
the policy within the PILCO learning framework. First, to 
compute the long-term predictions p{xi),... ,p{xx) for 
policy evaluation, the policy must allow us to compute 
a distribution over controls p{u) = p(f(x)) for a given 
(Gaussian) state distribution p(x). Second, in a realistic 
real-world application, the amplitudes of the control 
signals are bounded. Ideally, the learning system takes 
these constraints explicitly into account. In the following, 
we detail how PILCO implements these desiderata. 

5.1 Predictive Distribution over Controis 

During the long-term predictions, the states are given by 
a probability distribution p{xt), t = 0,..., T. The prob¬ 
ability distribution of the state xt induces a predictive 
distribution p{ut) = p{Tr{xt)) over controls, even when 
the policy is deterministic. We approximate the distribu¬ 
tion over controls using moment matching, which is in 
many interesting cases analytically tractable. 

5.2 Constrained Control Signals 

In practical applications, force or torque limits are 
present and must be accounted for during plan¬ 
ning. Suppose the control limits are such that u c 
[—Mmax, Mmaxj- Let US consider a preliminary policy tt with 
an unconstrained amplitude. To account for the control 
limits coherently during simulation, we squash the pre¬ 
liminary policy TT through a bounded and differentiable 
squashing function, which limits the amplitude of the 
final policy tt. As a squashing function, we use 

cr(a;) = I sin(a;)-I-I sin(3x) €[—1,1], (36) 

which is the third-order Fourier series expansion of a 
trapezoidal wave, normalized to the interval [—1,1]- The 
squashing function in (36) is computationally convenient 
as we can analytically compute predictive moments for 
Gaussian distributed states. Subsequently, we multiply 
the squashed policy by Umax and obtain the final policy 

7r)x) = 'UijiaxIT( 77 ( 31 )) G [ U-niax? Tirnax] 5 (37) 
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(a) Preliminary policy -S- as a func- (b) Policy vr = o-(-k{x)) as a func¬ 
tion of the state. tion of the state. 

Fig. 3. Constraining the controi signai. Panei (a) shows 
an exampie of an unconstrained preiiminary poiicy it as a 
function of the state x. Panei (b) shows the constrained 
poiicy 7r(a:) = CT(7f(a:)) as a function of the state x. 

an illustration of which is shown in Fig. 3. Although the 
squashing function in (36) is periodic, it is almost always 
used within a half wave if fhe preliminary policy if is 
initialized fo produce function values fhaf do nof exceed 
fhe domain of a single period. Therefore, fhe periodicify 
does nof matter in pracfice. 

To compufe a disfribufion over consfrained confrol 
signals, we execufe fhe following sfeps: 

p{xt) p{T^{xt)) p{u^^y,a{Tr{xt))) = p{ut). (38) 

Firsf, we map fhe Gaussian sfafe disfribufion p{xt) 
fhrough fhe preliminary (unconsfrained) policy if. Thus, 
we require a preliminary policy if fhaf allows for closed- 
form compufafion of fhe momenfs of fhe disfribufion 
over confrols p{TT{xt)). Second, we squash fhe approx- 
imafe Gaussian disfribufion p{n{x)) according fo (37) 
and compufe exacfly fhe mean and variance of p{tt{x)). 
Defails are given in fhe Appendix. We approximafe 
p(-k(x)) by a Gaussian wifh these moments, yielding the 
distribution p(u) over controls in (38). 

5.3 Representations of the Preliminary Policy 

In the following, we presenf two represenfafions of 
fhe preliminary policy ff, which allow for closed-form 
compufafions of fhe mean and covariance of p{n(x)) 
when the state x is Gaussian distributed. We consider 
both a linear and a nonlinear representations of if. 

5.3.1 Linear Policy 

The linear preliminary policy is given by 

7f(a;*) = Aa;*-t 6, (39) 

where A is a paramefer mafrix of weighfs and b is an 
offsef vecfor. In each confrol dimension d, fhe policy in 
(39) is a linear combination of fhe sfafes (fhe weighfs are 
given by fhe dfh row in A) plus an offsef bd. 

The predictive disfribufion p(7f(a:*)) for a sfafe disfri¬ 
bufion a:* ~ S*) is an exacf Gaussian wifh mean 

and covariance 

+ b, c0Va;j7f(a;*)] = AS*A^ , (40) 

respectively A drawback of fhe linear policy is fhaf if 
is nof flexible. Flowever, a linear confroller can often be 
used for sfabilizafion around an equilibrium. 


5.3.2 Nonlinear Policy: Deterministic Gaussian Process 

In fhe nonlinear case, we represenf fhe preliminary 
policy n by 

N 

Tr{x^) = '^^k{mi,x^){K -|-cr^7)“^f = k{M,x ^)^ol , (41) 

i=l 

where a;* is a fesf inpuf, a = {K -t- 0.01J)~^f, where 
t plays fhe role of a GP's fraining fargefs. In (41), 
M = [mi,..., miv] are fhe cenfers of fhe (axis-aligned) 
Gaussian basis functions 

k{Xp,Xq) = exp {- ^{Xp- XqyA~^{Xp - Xq)) . (42) 

We call fhe policy represenfafion in (41) a deterministic GP 
wifh a fixed number of N basis funcfions. Here, "defer- 
minisfic" means fhaf fhere is no uncerfainfy abouf fhe 
underlying funcfion, fhaf is, var^. [7f(a;)] = 0. Therefore, 
fhe deferminisfic GP is a degenerafe model, which is 
functionally equivalent to a regularized RBF network. 
The deterministic GP is functionally equivalent to the 
posterior GP mean function in (6), where we set the 
signal variance to 1, see (42), and the noise variance to 
0.01. As the preliminary policy will be squashed through 
a in (36) whose relevant support is the interval [—f, f], 
a signal variance of 1 is abouf righf. Seffing addifionally 
fhe noise sfandard deviafion fo 0.1 corresponds fo fixing 
fhe signal-fo-noise ratio of fhe policy fo 10 and, hence, 
fhe regularization. 

For a Gaussian distributed state a;* ~ S*), the 

predictive mean of if (a;*) as defined in (41) is given as 

IEcc.[7f(a;,)] = a;*)] 

= o^J J k{M,x.,)p{x^)dx., = cxjra , (43) 

where for i = 1,... ,N and all policy dimensions a = 
1,...,F 

ra, = + in 

X exp(-i(/a* - mi)^(S* -h Aa)“^(/T* - m^)). 

The diagonal mafrix confains fhe squared lengfh- 
scales £i, i — 1,...,D- The predicfed mean in (43) is 
equivalent to the standard predicted GP mean in (17). 

For a,b = 1,..., F, the entries of fhe predictive covari¬ 
ance matrix are compufed according fo 

C0V3;j7fo(a;*),7f{,(a:,)] 

— [tt,,( ai,^)7r5(ai*)] [Ta(ai*)jEa;^ [7r^(ai,j,)], 

where E3;j7f{a_{,}(a;*)] is given in (43). Hence, we focus 
on fhe ferm E^;, [7fa(a:*)7fh(a;*)], which for a, 6 = 1,..., F 
is given by 
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Algorithm 2 Computing the Successor State Distribution 

1 : init: Xt ~ ^t) 

2: Control distribution p(Mi) = p{uynaxO'{n{xt,0))) 

3: Joint state-control distribution p{xt) =p{xt,ut) 

4: Predictive GP distribution of change in state p{^t) 
5: Distribution of successor sfafe p{xt+i) 


For i,j = 1,... ,N, we compufe fhe enfries of Q as 

Qij — J' ka (ru^, m,); J ki) , x^^pi^X:^^ dx^ 

= ka{mi,x^)kt{mj,x^)\R\ 2 Zij), 

R = S,(A-i + A-i) + 1, T = A”i + A^i + S;! , 

- ^j) ■ 

Combining fhis resulf wifh (43) fully defermines fhe 
predictive covariance mafrix of fhe preliminary policy 
Unlike fhe predicfive covariance of a probabilisfic GP, 
see (21)-(22), fhe predicfive covariance mafrix of fhe de- 
ferminisfic GP does nof comprise any model uncerfainfy 
in ifs diagonal enfries. 

5.4 Policy Parameters 

In fhe following, we describe fhe policy paramefers for 
bofh fhe linear and fhe nonlinear policy^. 

5.4.1 Linear Policy 

The linear policy in (39) possesses D + 1 paramefers per 
confrol dimension: For confrol dimension d fhere are D 
weighfs in fhe dfh row of fhe mafrix A. One addifional 
paramefer originafes from fhe offsef paramefer bd. 

5.4.2 Nonlinear Policy 

The paramefers of fhe deferminisfic GP in (41) are fhe 
locafions M of fhe cenfers {DN paramefers), fhe (shared) 
lengfh-scales of fhe Gaussian basis funcfions {D lengfh- 
scale paramefers per fargef dimension), and fhe N far- 
gefs t per fargef dimension. In fhe case of mulfivariafe 
confrols, fhe basis funcfion cenfers M are shared. 

5.5 Computing the Successor State Distribution 

Alg. 2 summarizes the computational steps required to 
compute the successor state distribution p{xt+i) from 
p{xt). The computation of a distribution over controls 
p{ut) from the state distribution p(a:() requires two steps: 
First, for a Gaussian state distribution p{xt) at time t a 
Gaussian approximation of the distribution p{TT{xt)) of 
the preliminary policy is computed analytically. Second, 
the preliminary policy is squashed through a and an 
approximate Gaussian distribution of p{u^i,^a{k{xt))) 
is computed analytically in (38) using results from 

4. For notational convenience, with a (non)linear policy we mean 
the (non)linear preliminary policy ir mapped through the squashing 
function a and subsequently multiplied by Umax- 


the Appendix. Third, we analytically compute a Gaus¬ 
sian approximation to the joint distribution p{xt,Ut) = 
p{xt, Tr{xt)). For this, we compute (a) a Gaussian approx¬ 
imation to the joint distribution p{xt,Tr{xt)), which is 
exact if tt is linear, and (b) an approximate fully joint 
Gaussian distribution p{xt,7r{xt),ut). We obtain cross¬ 
covariance information between the state xt and the 
control signal Ut = u^g^^a{Tr{xt)) via 

cov[xt, Ut] = cov[xt,n{xt)]cov[TT{xt), Tr{xt)]~'^cov[7r{xt), Ut], 

where we exploit the conditional independence of xt 
and Ut given TT{xt). Then, we integrate Ti'{xt) out to 
obtain the desired joint distribution p{xt, ut). This leads 
to an approximate Gaussian joint probability distribution 
p{xt, Ut) = p{xt, 7r(xt)) = p(xt). Fourth, with the approx¬ 
imate Gaussian input distribution p{xt), the distribution 
p{^t) of the change in state is computed using the 
results from Sec. 4. Finally, the mean and covariance of 
a Gaussian approximation of the successor state distri¬ 
bution p{xt+x) are given by (9) and (10), respectively 

All required computations can be performed analyti¬ 
cally because of the choice of the Gaussian covariance 
function for the GP d 5 mamics model, see (3), the repre¬ 
sentations of the preliminary policy tt, see Sec. 5.3, and 
the choice of the squashing function, see (36). 

6 Cost Function 

In our learning set-up, we use a cost function that 
solely penalizes the Euclidean distance d of the current 
state to the target state. Using only distance penalties is 
often sufficient to solve a task: Reaching a target retarget 
with high speed naturally leads to overshooting and, 
thus, to high long-term costs. In particular, we use the 
generalized binary saturating cost 

cix) = 1 - exp ( - 2 ^ d{x, aitarget)^) € [0,1], (44) 

which is locally quadratic but saturates at unity for large 
deviations d from the desired target aitarget- In (44), the 
geometric distance from the state x to the target state is 
denoted by d, and the parameter CTc controls the width 
of the cost function.® 

In classical control, typically a quadratic cost is as¬ 
sumed. However, a quadratic cost tends to focus atten¬ 
tion on the worst deviation from the target state along 
a predicted trajectory. In the early stages of learning the 
predictive uncertainty is large and, therefore, the policy 
gradients, which are described in Sec. 3.3 become less 
useful. Therefore, we use the saturating cost in (44) as a 
default within the PILCO learning framework. 

The immediate cost in (44) is an unnormalized Gaus¬ 
sian with mean aitarget and variance subtracted from 

5. In the context of sensorimotor control, the saturating cost function 
in (44) resembles the cost function in human reasoning as experimen¬ 
tally validated by [31]. 
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2.5 — cost function / \ 

- - peaked state distjl'i^ution 
o --wide state distribution 

^ I 1 



state 


(a) When the mean of the state 
is far away from the target, un¬ 
certain states (red, dashed-dotted) 
are preferred to more certain states 
with a more peaked distribution 
(black, dashed). This leads to initial 
exploration. 


(b) When the mean of the state is 
close to the target, peaked state 
distributions (black, dashed) cause 
less expected cost and, thus, are 
preferable to more uncertain states 
(red, dashed-dotted), leading to ex¬ 
ploitation close to the target. 


Fig. 4. Automatic expioration and expioitation with the 
saturating cost function (biue, soiid). The s-axes describe 
the state space. The target state is the origin. 


unity. Therefore, the expected immediate cost can be 
computed analytically according to 

IEx[c(ic)] = J c{x)p{x)dx (45) 

= 1 - / exp ( - i(a; - Xt^jgeiV{x - a;target))p(a:) dx , 


where is the precision matrix of the unnormalized 
Gaussian in (45). If the state x has the same representa¬ 
tion as the target vector, is a diagonal matrix with 
entries either unity or zero, scaled by Hence, for 

X ~ At(/x, S) we obtain the expected immediate cost 

Ea,[c(a:)] = l-|J + ST-i|-i/" 

X exp(-i(/x - XtaigetVSi{fl - a;target)) , (46) 
Si :=T-\I+ 'ET-^)-K (47) 


The partial derivatives ^Ea,Jc(a;t)] of 

the immediate cost with respect to the mean and the 
covariance of the state distribution p{xt) = 
which are required to compute the policy gradients 
analytically, are given by 


dEa;Jc(xt)] 

dEa;Jc(a;t)] 

dSt 


Ea;t[c(iCi)] •retarget) *^1 , (48) 

(49) 

^ ^target) ^target) ^ 


is more likely to have substantial tails in some low- 
cost region than a more peaked distribution as shown in 
Fig. 4(a). In the early stages of learning, the predictive 
state uncertainty is largely due to propagating model 
uncertainties forward. If we predict a state distribution 
in a high-cost region, the saturating cost then leads to 
automatic exploration by favoring uncertain states, i.e., 
states in regions far from the target with a poor d 5 mamics 
model. When visiting these regions during interaction 
with the physical system, subsequent model learning 
reduces the model uncertainty locally. In the subsequent 
policy evaluation, PILCO will predict a tighter state dis¬ 
tribution in the situations described in Fig. 4. 

If the mean of the state distribution is close to the 
target as in Fig. 4(b), wide distributions are likely to have 
substantial tails in high-cost regions. By contrast, the 
mass of a peaked distribution is more concentrated in 
low-cost regions. In this case, the policy prefers peaked 
distributions close to the target, leading to exploitation. 

To summarize, combining a probabilistic d 5 mamics 
model, Bayesian inference, and a saturating cost leads 
to automatic exploration as long as the predictions are 
far from the target—even for a policy, which greedily 
minimizes the expected cost. Once close to the target, the 
policy does not substantially deviate from a confident 
trajectory that leads the system close to the target.^ 


7 Experimental Results 

In this section, we assess PiLCO's key properties and 
show that PILCO scales to high-dimensional control prob¬ 
lems. Moreover, we demonstrate the hardware applica¬ 
bility of our learning framework on two real systems. 
In all cases, PILCO followed the steps outlined in Alg. 1. 
To reduce the computational burden, we used the sparse 
GP method of [50] after 300 collected data points. 

7.1 Evaluation of Key Properties 

In the following, we assess the quality of the approx¬ 
imate inference method used for long-term predictions 
in terms of computational demand and learning speed. 
Moreover, we shed some light on the quality of the Gaus¬ 
sian approximations of the predictive state distributions 
and the importance of Bayesian averaging. For these 
assessments, we applied PILCO to two nonlinear control 
tasks, which are introduced in the following. 


respectively, where .Si is given in (47). 

6.1 Exploration and Exploitation 

The saturating cost function in (44) allows for a natural 
exploration when the policy aims to minimize the ex¬ 
pected long-term cost in (2). This property is illustrated 
in Fig. 4 for a single time step where we assume a 
Gaussian state distribution p(xt). If the mean of p{xt) 
is far away from the target ajtarget/ a wide state distribution 


7.1.1 Task Descriptions 

We considered two simulated tasks (double-pendulum 
swing-up, cart-pole swing-up) to evaluate important 
properties of the PILCO policy search framework: learn¬ 
ing speed, quality of approximate inference, importance 
of Bayesian averaging, and hardware applicability. In the 
following we briefly introduce the experimental set-ups. 

6. Code is available at http://mloss.org/software/view/508/. 
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target -|- 



Fig. 5. Double pendulum with two actuators applying 
torques ui and U 2 - The cost function penalizes the dis¬ 
tance d to the target. 


7.1.1.1 Double-Pendulum Swing-Up with Two 
Actuators: The double pendulum system is a two-link 
robot arm with two actuators, see Fig. 5. The state x is 
given by the angles 9i, 6*2 and the corresponding angular 
velocities 61,62 of the inner and outer link, respectively, 
measured from being uprighf. Each link was of length 
1 m and mass 0.5 kg. Both torques ui and U 2 were 
constrained to [—3,3] Nm. The control signal could be 
changed every 100 ms. In the meantime it was constant 
(zero-order-hold control). The objective was to learn a 
controller that swings the double pendulum up from an 
initial distribution p{xq) around /Tq = [7r,7r, 0,0]^ and 
balances it in the inverted position with 9i = 0 = 02 - 
The prediction horizon was 2.5 s. 

The task is challenging since its solution requires the 
interplay of fwo correlafed confrol signals. The challenge 
is to automatically learn this interplay from experience. 
To solve the double pendulum swing-up task, a non¬ 
linear policy is required. Thus, we parametrized the 
preliminary policy as a deterministic GP, see (41), with 
100 basis functions resulting in 812 policy parameters. 
We chose the saturating immediate cost in (44), where 
the Euclidean distance between the upright position and 
the tip of the outer link was penalized. We chose the 
cost width (Tc = 0.5, which means that the tip of the 
outer pendulum had to cross horizontal to achieve an 
immediate cost smaller than unity. 

7.1.1.2 Cart-Pole Swing-Up: The cart-pole system 
consists of a cart running on a track and a freely swing¬ 
ing pendulum affached fo the cart. The state of the sys¬ 
tem is the position x of the cart, the velocity x of the cart, 
the angle 9 of the pendulum measured from hanging 
downward, and the angular velocity 9. A horizontal 
force u G [— 10,10]N could be applied fo fhe cart. The 
objective was to learn a controller to swing the pendu¬ 
lum up from around /Tq = [xq, io, 6 ^ 0 , ^ 0 ]^ = [ 0 , 0 , 0 , 0 ]^ 
and to balance it in the inverted position in the middle of 
fhe track, i.e., around xtarget = [0, *, tt, *]^. Since a linear 
controller is not capable of solving the task [45], PILCO 
learned a nonlinear state-feedback controller based on a 
deterministic GP with 50 basis functions (see Sec. 5.3.2), 
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Stale space dimensionality 


(a) Linearizing the mean function. 


(b) Moment matching. 


Fig. 6. Empirical computational demand for approximate 
Inference and derivative computation with GPs for a single 
time step, shown on a log scale, (a): Linearization of the 
posterior GP mean, (b): Exact moment matching. 


resulting in 305 policy parameters to be learned. 

In our simulation, we set the masses of fhe carf and 
fhe pendulum to 0.5 kg each, the length of the pendulum 
to 0.5 m, and the coefficient of friction between cart 
and ground to 0.1 Ns/m. The prediction horizon was 
set to 2.5 s. The control signal could be changed every 
100 ms. In the meantime, it was constant (zero-order- 
hold control). The only knowledge employed about the 
system was the length of fhe pendulum fo find appropri¬ 
ate orders of magnitude to set the sampling frequency 
(10 Hz) and fhe sfandard deviafion of the cost function 
(ctc = 0.25 m), requiring the tip of the pendulum to move 
above horizontal not to incur full cost. 

7. 1.2 Approximate Inference Assessment 
In the following, we evaluafe the quality of the pre¬ 
sented approximate inference methods for policy eval- 
uafion (moment matching as described in Sec. 4.1) and 
linearization of fhe posferior GP mean as described 
in Sec. 4.2) with respect to computational demand 
(Sec. 7.1.2.1) and learning speed (Sec. 7.1.2.2). 

7.1.2.1 Computational Demand: For a single time 
step, the computational complexity of moment matching is 
0{n‘^E‘^D), where n is fhe number of GP framing points, 
D is the input dimensionality, and E the dimension of 
fhe prediction. The mosf expensive computations are the 
entries of Q S which are given in (26). Each entry 

Qij requires evaluating a kernel, which is essentially a 
D-dimensional scalar product. The values Zij are cheap 
to compute and R needs to be computed only once. We 
end up with 0{v?E‘^D) since Q needs to be computed 
for all entries of the E x E predictive covariance matrix. 

For a single time step, the computational complexity of 
linearizing the posterior GP mean function is OlrfDE). The 
mosf expensive operafion is the determination of in 
(34), i.e., the model uncertainty at the mean of fhe input 
distribution, which scales in 0{n^D). This computation 
is performed for all E predictive dimensions, resulting 
in a compufafional complexify of OfnfDE). 

Fig. 6 illusfrates fhe empirical computational effort for 
bofh linearizafion of the posterior GP mean and exact 
moment matching. We randomly generated GP models 
in D = 1, 2, 3,4, 5, 6 , 7, 8 , 9,10,15, 20, 50 dimensions and 
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GP training set sizes of n = 100,250,500,1000 data 
points. We set the predictive dimension E = D. The 
CPU time (single core) for compufing a predicfive sfafe 
disfribufion and fhe required derivafives are shown 
as a funcfion of fhe dimensionalify of fhe sfafe. Four 
graphs are shown for sef-ups wifh 100, 250, 500, and 
1000 GP framing poinfs, respectively. Fig. 6(a) shows fhe 
graphs for approximafe inference based on linearizafion 
of fhe posferior GP mean, and Fig. 6(b) shows fhe 
corresponding graphs for exacf moment matching on a 
logarithmic scale. Computations based on linearization 
were consistently faster by a factor of 5-10. 

7.1.2.2 Learning Speed: For eighf differenf ran¬ 
dom initial frajecfories and confroller initializations, 
PILCO followed Alg. 1 fo learn policies. In fhe carf-pole 
swing-up fask, PILCO learned for 15 episodes, which 
corresponds fo a fofal of 37.5 s of dafa. In fhe double¬ 
pendulum swing-up fask, PILCO learned for 30 episodes, 
corresponding fo a fofal of 75 s of dafa. To evaluafe fhe 
learning progress, we applied fhe learned confrollers 
after each policy search (see line 10 in Alg. 1) 20 times for 
2.5 s, sfarting from 20 differenf initial states xq ^ p{xo). 
The learned controller was considered successful when 
fhe fip of fhe pendulum was close fo fhe fargef locafion 
from 2 s fo 2.5 s, i.e., af fhe end of fhe rollouf. 

• Cart-Pole Swing-Up. Fig. 7(a) shows PiLCO's aver¬ 
age learning success for fhe carf-pole swing-up fask 
as a funcfion of fhe fofal experience. We evaluafed 
bofh approximafe inference mefhods for policy eval¬ 
uation, moment matching and linearization of fhe 
posferior GP mean funcfion. Fig. 7(a) shows fhat 
learning using fhe compufafionally more demand¬ 
ing momenf mafching is more reliable fhan using 
fhe compufafionally more advanfageous lineariza¬ 
fion. On average, affer 15 s-20 s of experience, PILCO 
reliably, i.e., in « 95% of fhe fesf runs, solved fhe 
carf-pole swing-up fask, whereas fhe linearizafion 
resulfed in a success rafe of abouf 83%. 

Fig. 7(b) relafes PiLCO's learning speed (blue bar) 
fo ofher RL mefhods (black bars), which solved fhe 
carf-pole swing-up fask from scrafch, i.e., wifhouf 
human demonsfrafions or known d 5 rnamics mod¬ 
els [11], [18], [27], [45], [56]. Dynamics models 
were only learned in [18], [45], using RBF networks 
and mulfi-layered percepfrons, respecfively. In all 
cases wifhouf sfafe-space discrefizafion, cosf func¬ 
tions similar fo ours (see (44)) were used. Fig. 7(b) 
sfresses PiLCO's dafa efficiency: PiLCO outperforms 
any ofher currenfly exisfing RL algorifhm by af leasf 
one order of magnitude. 

• Double-Pendulum Swing-Up with Two Actuators. 

Fig. 8 shows the learning curves for fhe double¬ 
pendulum swing-up fask when using eifher mo¬ 
menf matching or mean function linearization for 
approximafe inference during policy evaluation. 
Fig. 8 shows fhaf PILCO learns fasfer (learning al¬ 
ready kicks in affer 20 s of dafa) and overall more 



(a) Average learning curves with 
95% standard errors: moment 
matching (MM) and posterior GP 
linearization (Lin). 



(b) Required interaction time of 
different RL algorithms for learn¬ 
ing the cart-pole swing-up from 
scratch, shown on a log scale. 


Fig. 7. Results for the cart-pole swing-up task, (a) Learn¬ 
ing curves for moment matching and linearization (sim¬ 
ulation task), (b) required interaction time for solving the 
cart-pole swing-up task compared with other algorithms. 



Fig. 8. Average success as a function of the total data 
used for learning (double pendulum swing-up). The blue 
error bars show the 95% confidence bounds of the stan¬ 
dard error for the moment matching (MM) approximation, 
the red area represents the corresponding confidence 
bounds of success when using approximate inference by 
means of linearizing the posterior GP mean (Lin). 


successfully wifh momenf mafching. Policy evalua¬ 
tion based on linearizafion of fhe posferior GP mean 
funcfion achieved abouf 80% success on average, 
whereas moment mafching on average solved fhe 
fask reliably affer abouf 50 s of dafa wifh a success 
rafe « 95%. 

Summary. We have seen fhaf bofh approximafe inference 
mefhods have pros and cons: Momenf matching requires 
more computational resources than linearization, but 
learns faster and more reliably. The reason why lineariza¬ 
tion did not reliably succeed in learning the tasks is that 
it gets relatively easily stuck in local minima, which is 
largely a result of underestimating predicfive variances, 
an example of which is given in Fig. 2. Propagating 
foo confidenf predictions over a longer horizon offen 
worsens fhe problem. Hence, in fhe following, we focus 
solely on fhe momenf mafching approximation. 

7.1.3 Quality of the Gaussian Approximation 
PiLCO strongly relies on the quality of approximafe 
inference, which is used for long-ferm predicfions and 
policy evaluafion, see Sec. 4. We already saw differences 
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Fig. 9. Long-term predictive (Gaussian) distributions dur¬ 
ing pianning (shaded) and sampie roiiouts (red), (a) In the 
eariy stages of iearning, the Gaussian approximation is a 
suboptimai choice, (b) Pilco iearned a controiier such 
that the Gaussian approximations of the predictive states 
are good. Note the different scaies in (a) and (b). 

between linearization and moment matching; however, 
both methods approximate predictive distributions by 
a Gaussian. Although we ultimately carmot answer 
whether this approximation is good under all circum¬ 
stances, we will shed some light on this issue. 

Fig. 9 shows a t 5 rpical example of the angle of fhe inner 
pendulum of fhe double pendulum sysfem where, in 
fhe early sfages of learning, fhe Gaussian approximafion 
fo fhe mulfi-sfep ahead predicfive disfribufion is nof 
ideal. The frajecfory disfribufion of a sef of rolloufs 
(red) is mulfimodal. PiLCO deals wifh fhis inappropriafe 
modeling by learning a confroller fhaf forces fhe acfual 
frajecfories info a unimodal disfribufion such fhaf a 
Gaussian approximafion is appropriafe. Fig. 9(b). 

We explain fhis behavior as follows: Assuming fhaf 
PILCO found differenf pafhs fhaf lead fo a fargef, a 
wide Gaussian disfribufion is required fo capfure fhe 
variabilify of fhe bimodal disfribufion. However, when 
compufing fhe expecfed cosf using a quadrafic or saf- 
urating cosf, for example, uncerfainfy in the predicted 
state leads to higher expected cost, assuming that the 
mean is close to the target. Therefore, PILCO uses ifs 
abilify fo choose confrol policies fo push fhe marginally 
mulfimodal frajecfory disfribufion info a single mode— 
from fhe perspective of minimizing expecfed cosf wifh 
limifed expressive power, fhis approach is desirable. 
Effectively, learning good confrollers and models goes 
hand in hand wifh good Gaussian approximafions. 

7.1.4 Importance of Bayesian Averaging 

Model-based RL greafly profifs from fhe flexibilify of 
nonparamefric models as mofivafed in Sec. 2. In fhe 
following, we have a closer look af whefher Bayesian 
models are sfricfly necessary as well. In particular, we 
evaluafed whefher Bayesian averaging is necessary for 
successfully learning from scrafch. To do so, we con¬ 
sidered fhe carf-pole swing-up fask wifh two differ¬ 
enf d 5 mamics models: firsf, fhe sfandard nonparamefric 
Bayesian GP model, second, a nonparamefric defermin- 
isfic GP model, i.e., a GP where we considered only fhe 


TABLE 1 

Average learning success with learned nonparametric 
(NP) transition models (cart-pole swing-up). 



Bayesian NP model 

Deterministic NP model 

Learning success 

94.52% 

0% 


posterior mean, but discarded the posterior model un¬ 
certainty when doing long-term predictions. We already 
described a similar kind of function representation to 
learn a deterministic policy, see Sec. 5.3.2. The difference 
to the policy is that in this section the deterministic GP is 
still nonparametric (new basis functions are added if we 
get more data), whereas the number of basis functions 
in the policy is fixed. However, the deterministic GP 
is no longer probabilistic because of the loss of model 
uncertainty, which also results in a degenerate model. 
Note that we still propagate uncertainties resulting from 
the initial state distribution p{xq) forward. 

Tab. 1 shows the average learning success of swing¬ 
ing the pendulum up and balancing it in the inverted 
position in the middle of the track. We used moment 
matching for approximate inference, see Sec. 4. Tab. 1 
shows that learning is only successful when model 
uncertainties are taken into account during long-term 
planning and control learning, which strongly suggests 
Bayesian nonparametric models in model-based RL. 

The reason why model uncertainties must be appro¬ 
priately taken into account is the following: In the early 
stages of learning, the learned d 5 rnamics model is based 
on a relatively small data set. States close to the target are 
unlikely to be observed when applying random controls. 
Therefore, the model must extrapolate from the current 
set of observed states. This requires to predict function 
values in regions with large posterior model uncertainty. 
Depending on the choice of the deterministic function 
(we chose the MAP estimate), the predictions (point 
estimates) are very different. Iteratively predicting state 
distributions ends up in predicting trajectories, which 
are essentially arbitrary and not close to the target state 
either, resulting in vanishing policy gradients. 

7.2 Scaling to Higher Dimensions: Unicycling 

We applied PILCO to learning to ride a 5-DoF unicycle 
with X G and it g in a realistic simulation of 
the one shown in Fig. 10(a). The unicycle was 0.76 m 
high and consisted of a 1 kg wheel, a 23.5 kg frame, and 
a 10 kg flywheel mounted perpendicularly to the frame. 
Two torques could be applied to the unicycle: The first 
torque jitu,| < 10 Nm was applied directly on the wheel to 
mimic a human rider using pedals. The torque produced 
longitudinal and tilt accelerations. Lateral stability of the 
wheel could be maintained by steering the wheel toward 
the falling direction of the unicycle and by applying a 
torque \ut \ < 50 Nm to the flywheel. The d 5 mamics of the 
robotic unicycle were described by 12 coupled first-order 
ODEs, see [24]. 
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(a) Robotic unicycle. 


|d e (3,10] cm Bd e (10,50] cm d>50cm 



(b) Histogram (after 1,000 test runs) of the 
distances of the flywheel from being upright. 


Fig. 10. Robotic unicycie system and simuiation resuits. 
The state space is the controi space R^. 



Fig. 12. Low-cost robotic arm by Lynxmotion [1]. The 
manipuiator does not provide any pose feedback. Flence, 
PiLCO iearns a controiier directiy in the task space using 
visuai feedback from a PrimeSense depth camera. 


The objective was to team a controller for riding the 
unicycle, i.e., to prevent it from falling. To solve fhe 
balancing fask, we used fhe linear preliminary policy 
TT(a:, 9) = Ax 4- b wifh 9 — {A, 6} s R^®. The covariance 
So of the inifial sfafe was 0.25^/ allowing each angle fo 
be off by abouf 30° (twice fhe sfandard deviafion). 

PiLCO differs from convenfional confrollers in fhaf 
if learns a single confroller for all confrol dimensions 
jointly. Thus, PILCO fakes fhe correlafion of all confrol 
and sfafe dimensions info accounf during plarming and 
confrol. Learning separafe confrollers for each confrol 
variable is offen unsuccessful [37]. 

PiLCO required abouf 20 frials, corresponding fo an 
overall experience of abouf 30 s, fo learn a d 5 mamics 
model and a confroller fhaf keeps fhe unicycle uprighf. 
A frial was aborfed when fhe fumfable hif fhe ground, 
which happened quickly during fhe five random frials 
used for inifializafion. Fig. 10(b) shows empirical resulfs 
affer 1,000 fesf runs wifh fhe learned policy: Differenfly- 
colored bars show fhe disfance of fhe fl 5 rwheel from 
a fully uprighf posifion. Depending on fhe inifial con- 
figurafion of fhe angles, fhe unicycle had a fransienf 
phase of abouf a second. Affer 1.2 s, eifher fhe unicycle 
had fallen or fhe learned confroller had managed fo 
balance if very closely fo fhe desired uprighf posifion. 
The success rafe was approximafely 93%; bringing fhe 
unicycle uprighf from exfreme inifial configurafions was 
somefimes impossible due fo fhe forque consfrainfs. 

7.3 Hardware Tasks 

In fhe following, we presenf resulfs from [15], [16], 
where we successfully applied fhe PILCO policy search 
framework fo challenging confrol and robofics fasks, 
respecfively. If is imporfanf fo menfion fhaf no fask- 
specific modificafions were necessary, besides choosing 
a confroller represenfafion and defining an immediafe 
cosf funcfion. In parficular, we used fhe same sfandard 
GP priors for learning fhe forward d 5 mamics models. 

7.3.1 Cart-Pole Swing-Up 

As described in [15], PILCO was applied fo learning fo 
confrol fhe real carf-pole sysfem, see Fig. 11, developed 


by [26]. The masses of fhe carf and pendulum were 
0.7 kg and 0.325 kg, respecfively. A horizonfal force u C 
[—10,10] N could be applied fo fhe carf. 

PiLCO successfully learned a sufficienfly good d 3 mam- 
ics model and a good confroller fully aufomafically in 
only a handful of frials and a fofal experience of 17.5 s, 
which also confirms fhe learning speed of fhe simulafed 
carf-pole sysfem in Fig. 7(b) despife fhe facf fhaf fhe 
paramefers of fhe sysfem d 5 mamics (masses, pendu¬ 
lum lengfh, fricfion, delays, sficfion, efc.) are differenf. 
Snapshofs of a 20 s fesf frajecfory are shown in Fig. 11; 
a video of fhe enfire learning process is available af 
hffp: / / www.youfube.com / user/PilcoLeamer. 

7.3.2 Controlling a Low-Cost Robotic Manipulator 
We applied PILCO fo make a low-precision robofic 
arm learn fo sfack a fower of foam blocks—fully 
aufonomously [16]. For fhis purpose, we used fhe 
lighfweighf robofic manipulator by L 5 mxmofion [1] 
shown in Fig. 12. The arm cosfs approximafely $370 and 
possesses six confrollable degrees of freedom: base ro- 
fafe, fhree joinfs, wrisf rofafe, and a gripper (open/close). 
The plastic arm was controllable by commanding both 
a desired configuration of fhe six servos via fheir pulse 
durafions and fhe durafion for execufing fhe command. 
The arm was very noisy: Tapping on fhe base made 
fhe end effecfor swing in a radius of abouf 2 cm. The 
sysfem noise was particularly pronounced when moving 
fhe arm vertically (up/down). Addifionally fhe servo 
motors had some play. 

Knowledge abouf fhe joinf configurafion of fhe robof 
was nof available. We used a PrimeSense depfh cam¬ 
era [2] as an external sensor for visual fracking fhe block 
in fhe gripper of fhe robof. The camera was identical 
fo fhe Kinecf sensor, providing a S 5 mchronized depfh 
image and a 640 x 480 RGB image af 30 Hz. Using 
sfrucfured infrared lighf, fhe camera delivered useful 
depfh informafion of objecfs in a range of abouf 0.5 m- 
5 m. The depfh resolufion was approximafely 1 cm af a 
disfance of 2 m [2]. 

Every 500 ms, fhe robof used fhe 3D center of fhe 
block in ifs gripper as fhe sfafe a: e R® fo compufe 
a confinuous-valued confrol signal u G R"*, which 
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Fig. 11. Real cart-pole system [15], Snapshots of a controlled trajectory of 20s length after having learned the task. 
To solve the swing-up plus balancing, pilco required only 17.5 s of interaction with the physical system. 


comprised the commanded pulse widths for the first 
four servo motors. Wrisf rofation and gripper opening/ 
closing were nof learned. For block fracking we used 
real-time (50 Hz) color-based region growing to esfimafe 
fhe exfenf and 3D confer of fhe objecf, which was used 
as fhe sfafe a; S IR^ by PILCO. 

As an inifial sfafe disfribufion, we chose p(xo) = 
A/’(a:o I Moj ^o) wifh fig being a single noisy measure- 
menf of fhe 3D camera coordinafes of fhe block in fhe 
gripper when fhe robof was in ifs inifial configuration. 
The inifial covariance So was diagonal, where fhe 95%- 
confidence bounds were fhe edge lengfh b of fhe block. 
Similarly, fhe fargef sfafe was sef based on a single noisy 
measuremenf using fhe PrimeSense camera. We used 
linear preliminary policies, i.e., •k(x) = u — Ax + b, and 
initialized the controller parameters 9 ~ {A, b} G to 
zero. The Euclidean distance d of fhe end effector from 
the camera was approximately 0.7m-2.0m, depending 
on the robot's configuration. The cost function in (44) 
penalized the Euclidean distance of fhe block in fhe 
gripper from ifs desired fargef location on fop of fhe 
currenf tower. Bofh fhe frequency af which fhe confrols 
were changed and fhe fime discrefizafion were sef fo 
2 Hz; fhe planning horizon T was 5 s. Affer 5 s, fhe robof 
opened fhe gripper and released fhe block. 

We splif fhe fask of building a tower info learning indi¬ 
vidual confrollers for each fargef block B2-B6 (bottom fo 
fop), see Eig. 12, sfarfing from a configuration, in which 
fhe robof arm was uprighf. All independenfly framed 
confrollers shared fhe same inifial trial. 

The motion of fhe block in fhe end effector was 
modeled by GPs. The inferred sysfem noise sfandard 
deviafions, which comprised sfochasficify of fhe robof 
arm, synchronizafion errors, delays, image processing 
errors, efc., ranged from 0.5 cm fo 2.0 cm. Here, fhe y- 
coordinafe, which corresponded fo fhe height, suffered 
from larger noise fhan fhe ofher coordinafes. The reason 
for fhis is fhaf fhe robof movemenf was particularly jerky 
in the up/down movements. These learned noise levels 
were in the right ballpark since they were slightly larger 
than the expected camera noise [2]. The signal-to-noise 
ratio in our experiments ranged from 2 fo 6. 

A fofal of fen learning-interacfing iferafions (including 
the random initial trial) generally sufficed fo learn bofh 
good forward models and good confrollers as shown 
in Pig. 13(a), which displays fhe learning curve for a 
fypical framing session, averaged over fen fesf runs affer 
each learning sfage and all blocks B2-B6. The effecfs 
of learning became noficeable affer abouf four learning 


iferations. Affer 10 learning iterations, the block in the 
gripper was expected to be very close (approximately at 
noise level) to the target. The required interaction time 
sums up to only 50 s per controller and 230 s in total (the 
initial random trial is counted only once). This speed of 
learning is difficulf fo achieve by ofher RL mefhods fhaf 
learn from scrafch as shown in Sec. 7.1.1.2. 

Pig. 13(b) gives some insighfs info fhe qualify of 
fhe learned forward model affer 10 confrolled frials. 
If shows fhe marginal predictive disfribufions and fhe 
acfual frajecfories of fhe block in fhe gripper. The robof 
learned fo pay affenfion fo sfabilizing fhe y-coordinafe 
quickly: Moving fhe arm up/down caused relafively 
large "sysfem noise" as fhe arm was quife jerky in fhis 
direction: In fhe y-coordinafe fhe predicfive marginal 
disfribufion noticeably increases befween Os and 2s. 
As soon as fhe y-coordinafe was sfabilized, fhe pre¬ 
dicfive uncerfainfy in all fhree coordinafes collapsed. 
Videos of fhe block-sfacking robof are available af 
http: / / www.youfube.com / user/PilcoLearner. 

8 Discussion 

We have shed some lighf on essenfial ingredienfs for 
successful and efficienf policy learning: (1) a probabilisfic 
forward model wifh a faifhful represenfafion of model 
uncerfainfy and (2) Bayesian inference. We focused on 
very basic represenfafions: GPs for fhe probabilisfic for¬ 
ward model and Gaussian disfribufions for fhe sfafe 
and confrol disfribufions. More expressive represenfa¬ 
fions and Bayesian inference mefhods are conceivable fo 
accounf for mulfi-modalify for insfance. However, even 
wifh our currenf sef-up, PILCO can already learn learn 
complex confrol and robotics fasks. In [8], our framework 
was used in an indusfrial applicafion for fhroffle valve 
confrol in a combusfion engine. 

Pilco is a model-based policy search mefhod, which 
uses fhe GP forward model fo predicf sfafe sequences 
given fhe currenf policy. These predictions are based 
on deferminisfic approximafe inference, e.g., momenf 
mafching. Unlike all model-free policy search mefhods, 
which are inherenfly based on sampling frajecfories [14], 
PILCO exploifs fhe learned GP model fo compufe analyfic 
gradienfs of an approximation fo fhe expecfed long- 
ferm cosf for policy search. Pinife differences or more 
efficient sampling-based approximations of fhe gradi¬ 
enfs require many function evaluations, which limifs 
fhe effective number of policy paramefers [14], [42]. 
Insfead, PILCO compufes fhe gradienfs analyfically and, 
fherefore, can learn fhousands of policy paramefers [15]. 
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(a) Average learning curve (block- 
stacking task). The horizontal axis 
shows the learning stage, the verti¬ 
cal axis the average distance to the 
target at the end of the episode. 


(b) Marginal long-term predictive distributions and actually incurred trajectories. The red lines show the 
trajectories of the block in the end effector, the two dashed blue lines represent the 95% confidence 
intervals of the corresponding multi-step ahead predictions using moment matching. The target state is 
marked by the straight lines. All coordinates are measured in cm. 


Fig. 13. Robot block stacking task: (a) Average learning curve with 95% standard error, (b) Long-term predictions. 


It is possible to exploit the learned GP model for sam¬ 
pling trajectories using the PEGASUS algorithm [39], for 
insfance. Sampling wifh GPs can be sfraighfforwardly 
parallelized, and was exploifed in [32] for learning mefa 
confrollers. However, even wifh high parallelizafion, 
policy search mefhods based on frajecfory sampling do 
usually nof rely on gradienfs [7], [30], [32], [40] and are 
pracfically limifed by a relafively small number of a few 
fens of policy paramefers fhey can manage [38].^ 

In Sec. 6.1, we discussed PiLCO's nafural explorafion 
property as a result of Bayesian averaging. If is, however, 
also possible fo explicifly encourage addifional explo¬ 
rafion in a UCB (upper confidence bounds) sense [6]: 
Insfead of summing up expecfed immediafe cosfs, see 
(2), we would add fhe sum of cosf sfandard devia- 
fions, weighfed by a factor k G R. Then, J'^{9) = 
(E[c(a;t)] -I- Ka[c{xt)]). This fype of ufilify funcfion is 
also offen used in experimenfal design [10] and Bayesian 
opfimizafion [9], [33], [41], [51] fo avoid geffing sfuck in 
local minima. Since PiLCO's approximafe sfafe disfribu- 
fions p{xt) are Gaussian, fhe cosf sfandard deviations 
a[c{xt)] can often be computed analytically. For further 
details, we refer fhe reader fo [12]. 

One of PiLCO's key benefifs is fhe reducfion of model 
errors by explicifly incorporafing model uncerfainfy info 
planning and confrol. PiLCO, however, does nof fake 
femporal correlafion info accounf. Insfead, model uncer¬ 
fainfy is freafed as noise, which can resulf in an under- 
esfimafion of model uncerfainfy [49]. On fhe ofher hand, 
fhe momenf-mafching approximafion used for approxi¬ 
mafe inference is f 5 rpically a conservafive approximafion. 

In this article, we focused on learning confrollers in 
MDPs wifh transifion d 5 mamics fhaf suffer from system 
noise, see (1). The case of measurement noise is more chal¬ 
lenging: Learning fhe GP models is a real challenge since 
we no longer have direcf access fo fhe sfafe. However, 
approaches for framing GPs wifh noise on bofh fhe fram¬ 
ing inpufs and fraining fargefs yield inifial promising 
resulfs [36]. For a more general POMDP sef-up, Gaussian 
Process D 5 mamical Models (GPDMs) [29], [54] could be 

7. "Typically, PEGASUS policy search algorithms have been using 
[...] maybe on the order of ten parameters or tens of parameters; so, 
30, 40 parameters, but not thousands of parameters [...]", A. Ng [38]. 


used for learning bofh a transifion mapping and the 
observation mapping. However, GPDMs t 5 rpically need 
a good initialization [53] since the learning problem is 
very high dimensional. 

In [25], the PILCO framework was extended to allow 
for learning reference fracking confrollers insfead of 
solely confrolling fhe system fo a fixed fargef locafion. 
In [16], we used PILCO for plarming and confrol in 
constrained environments, i.e., environmenfs wifh obsfa- 
cles. This learning sef-up is imporfanf for pracfical robof 
applicafions. By discouraging obsfacle collisions in fhe 
cosf funcfion, PILCO was able fo find pafhs around 
obsfacles wifhouf ever colliding wifh fhem, nof even 
during fraining. Inifially, when fhe model was uncerfain, 
fhe policy was conservative to stay away from obsfacles. 
The PILCO framework has been applied in fhe confexf 
of model-based imifafion learning fo learn confrollers 
fhat minimize fhe Kullback-Leibler divergence befween a 
disfribufion of demonsfrafed frajecfories and fhe predic- 
five disfribufion of robof frajecfories [20], [21]. Recenfly 
PILCO has also been exfended fo a mulfi-fask sef-up [13]. 


9 Conclusion 

We have infroduced PILCO, a pracfical model-based pol¬ 
icy search mefhod using analyfic gradienfs for policy 
learning. PiLCO advances sfafe-of-fhe-arf RL mefhods for 
confinuous sfafe and confrol spaces in ferms of learning 
speed by af leasf an order of magnifude. Key fo PiLCO's 
success is a principled way of reducing fhe effecf of 
model errors in model learning, long-ferm plarming, and 
policy learning. PiLCO is one of fhe few RL mefhods 
fhat has been directly applied to robotics without human 
demonstrations or other kinds of informafive initializa- 
fions or prior knowledge. 

The PILCO learning framework has demonsfrafed fhaf 
Bayesian inference and nonparamefric models for learn¬ 
ing confrollers is nof only possible buf also pracficable. 
Hence, nonparamefric Bayesian models can play a fun- 
damenfal role in classical confrol sef-ups, while avoiding 
fhe f 5 rpically excessive reliance on explicif models. 
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Appendix A 

Trigonometric Integration 


This section gives exact integral equations for trigono¬ 
metric functions, which are required to implement the 
discussed algorithms. The following expressions can be 
found in fhe book by [1], where x ^ is 

Gaussian disfribufed wifh mean p and variance cr^. 


E 3 ;[sin(a;)] = J sm{x)p{x) da; = exp(—sin(p), 

E 2 ,[cos(a;)] = J cos{x)p{x) dx = exp(—cos(p), 

E 3 :[sin(a:)^] = J sin(a;)^p(a:) da; 

= ^ (l — exp(—2 ct^) cos(2p)) , 

Ea;[cos(x)^] = J cos{x)'^p{x) dx 

= 5 (l -F exp(—2 ct^) cos(2p)) , 

E 3 :[sin(a;) cos(a;)] = J sin(a;) cos(ai)p(a;) da; 

— J k sin(2a;)p(a;) da; 

= I exp(—2tT^) sin(2p). 


Appendix B 
Gradients 


In fhe begirming of fhis secfion, we will give a few 
derivafive identifies fhat will become handy. Affer fhat 
we will defail derivafive compufafions in fhe contexf of 
fhe moment-mafching approximation. 


B.1 Identities 

Let us start with a set of basic derivafive idenfifies [2] 
fhaf will prove useful in fhe following: 


de 

d\K 
1)K 
dK-^{e) 

de 

de^KO 


= \K\tr K 


-idK\ 
de J 


-InT 


= \K\{K-^) 


= -K 


-i dKje) 

de 


K 


de 

dtT{AKB) 

dK 


= e^{KAK^ 


= 
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d\AK + I\ 


dK 


=-\AK + I\-\{AK + ly 


d 


OB,, 


-(a-5)' {A + B)-ya-b) 

I 

= -(a - 6)T [(A + + B)^A^ (a-b). 


In in the last identity B{:,i) denotes the ith column of 
B and B{i ,:) is the ith row of B. 


we obfain 


ggg. ^^2 ( d|I + A~^S«-l 
dSt-1 dtt-i 


1 

2 ^ ^ 

—r]{xi, St_i) 


+ |/ + A-ii:*_i 


1 

2 


d 

dSt-i 





B.2 Partial Derivatives of the Predictive Distribution 
with Respect to the Input Distribution 

For an input distribution Xt-i ~ N{xt-\ \ , 

where x = is the control-augmented state, 

we detail the derivatives of the predictive mean 
the predictive covariance Sa/ and the cross-covariance 
cov[it_i,A] (in the moment matching approximation) 
with respect to the mean fii_i and covariance T^t-i of 
the input distribution. 


B.2.1 Derivatives of the Predictive Mean with Respect to 
the Input Distribution 

In the following, we compute the derivative of the 
predictive GP mean € R® with respect to the 
mean and the covariance of the input distribution 
N{xt-i I The function value of the predic¬ 

tive mean is given as 

n 

i^l 

qa^=al\I + A-^%_^\-\ (51) 

X exp ( - \{x, - {Aa + 'Iit-i)~'^{xi - At-i)) ■ 

B.2.1.1 Derivative with respect to the Input Mean: 
Let us start with the derivative of the predictive mean 
with respect to the mean of the input distribution. From 
the function value in (51), we obtain the derivative 


dp\ 

dPt-i 



n 

^ ~ Pt—l) (^t—1 T Aa) 

i=l 


(52) 

(53) 


e ]I^ix(o+T’) fgj. target dimension, where we used 


dqg^ 

dPt-1 


Qaii^i Pt — l) (^i—1 T A„) 


(54) 


B.2.1.2 Derivative with Respect to the Input Co- 
variance Matrix: For the derivative of the predictive 
mean with respect to the input covariance matrix S(_i, 
we obtain 


a '9ga. 


(55) 


By defining 

7j[x^, S(_i) 

= exp ( - i(*i - {Aa + St-i)"^(&* - Pt-i)) 


for i = l,...,n. Here, we compute the two partial 
derivatives 


d|7 + A-^S,_i|-2 

dSt-i 

= -^|J + A„-iSt_i| 




=-^-\i + A-^%_,\-yi + 


d^t-i 

K 

-1\T 


((j + a-is,_i)-'a; 


-1 


= --|/ -f A:is,_ir 2 ((/ + A-^tt-y-^A- 


and for p,q = 1,... ,D + F 


(56) 

(57) 

(58) 

(59) 




(pq) 

t-i 


(Aa -|- S(_i) ^ 


(60) 


- -|((Aa -6 St_i)(,_p)(Ao 

+ (A„ + S,_i)^;,)(A, + e r(^+Gx(o+f) ^ 


where we need to explicitly account for the symmetry 
of Ag -1- Tit- 1 - Then, we obtain 






,-i a-G^ 


1 IZ. \T d{Aa -\- St_l) ^ ^ 

2 Pt-l) (®i Pt-l) 

oTt-1 


lx{D+F) 


{D+F)x{D+F)x{D+F)x{D+F) 


{D+F)xl 


{D+F)x{D+F) 

(61) 


where we used a tensor contraction in the last expres¬ 
sion inside the bracket when multiplying the difference 
vectors onto the matrix derivative. 


B.2.2 Derivatives of the Predictive Covariance with Re¬ 
spect to the Input Distribution 

For target dimensions a,b = 1,... ,E, the entries of the 
predictive covariance matrix Sa G R^^^ are given as 

= I^IiQ - Qa(lb)f^b 

+ 6gt,{al - tiUKg + aljr^Q)) (62) 
where 6gb = 1 if a = 6 and 0 otherwise. 
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The entries of Q e are given by 

X exp ( - - Xj)^{Aa + Ab)~'^{xi - Xj)) 

X exp 

X ((A-i + , (63) 


Zij — A^(A(j -|- A^) Xi Afj(Aa -j- A^) x 


j > 


(64) 


where i,j = 1,..., n. 

B.2.2.1 Derivative with Respect to the Input Mean: 
For the derivative of the entries of the predictive co- 
variance matrix with respect to the predictive mean, we 
obtain 


Using the partial derivative 

d|(A-^+Ab)-iSt_i+f| 

dSt-i 

= |(A:i+Ab)-iS*_i+J| 

X (((A: 1 +A,- 1 )S,_i+ 7 )-'(A-i + A,- 1 ))^ ( 69 ) 

= |(A-i+Ab)-iS*_i+/| (70) 

-udtt.A 
dtt-i ] 


tr ((^a ^ +-f) (^o^+^6^) 


the partial derivative of Qij with respect to the covari¬ 
ance matrix St_i is given as 


da'i 


^-T I dQ dq^ -y __ dqj 




= /3o 


-{K^ + alj) 


Qb -Qa 




dQij 

dtt-i 


2 rl-l 


dp-t-1 


(65) 


= c 


-M{A~^+Ab)-^tt., + I\-2 


where the derivative of Qij with respect to the input 
mean is given as 


X |(A-i-f-A6)-iS*_i+/| 


62 


X tr ((A„ ^-I-Aj ^)S(_i-I-j) (A„^-|-Aj,^) 


dQij 

9p-t-i 


-Qijizij /it_i)^((A„i-h Af, 1) ^-^St-i) ^ -h |(A„)i-1-At,)-iSt_i-h/pS 


i de2 


( 66 ) 




B.2.2.2 Derivative with Respect to the Input Co- 
variance Matrix: The derivative of the entries of the pre¬ 
dictive covariance matrix with respect to the covariance 
matrix of the input distribution is 


= c|(A”i+Ap-iSt_i+Jr2 


4(((A: 


de2 


dSt-i 


/3l ( 

dQ 



dtt-i 

5St-i ' 

Qa 

+ ^ab 



dQ 

d%- 


dqJ 

dSt-i 


f3b 


dSt-i 


d^t-i J 

(71) 

(72) 

A,-pSt_i +/)"pAr + A,-p)^e2 

(73) 


(67) 


Since the partial derivatives dq^/d'St-i and dq^/d'Et-i 
are known from Eq. (56), it remains to compute 
dQ/d'St-i- The entries Qij, i,j = 1,... ,n are given in 
Eq. (63). By defining 

■= ^/a^A ®xp - Xj)^(Af^ + Af^)-\xi - Xj))) 

62 := exp 4 (% - /ij_P^((A-i -P A-p-i 

X ~ Mt-l)^ 

we obtain the desired derivative 


where the partial derivative of 62 with respect to the 

^ -I- Aj, ^)“^ -I- S(_i) 


entries is given as 


562 _ 1 ~ , 

~ Mt-lJ 

01 ^ 


-‘t-1 


X {Zij - fit_f)e2 ■ (74) 

The missing partial derivative in (74) is given by 

5((A-i+A,-p-i + S*_i)-' 


5S 




— ^{pi) ’ (^5) 


dQij 

dSt-i 


-i|(A“i+Ap-iSt_i+J| 


5|(A:^+Afe)-iS,_i+J| 

5St-i 

+ |(A-i+A,-pSt_i+Jri 


62 


i 562 
5Si-i 


• ( 68 ) 


where we define 

p, g = 1,..., D -I- F with 

^(p,) = f((A-' + A,-p-i + i:*_i)- 

X ((Aa ^-I-Aj, p ^-I- 


(76) 


’O.p) 


(77) 
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This finally yields 

dQ, 




= ce2|(A-^+Ab)-'St_i+/| 


^ +-f) (Aa^+Af,\ 


{Zij Mt-i) At-l) 


(78) 




^ +-f) (^a^+^b\ 


[Zij P't-l) P-t-l) 


(79) 


which concludes the computations for the partial deriva¬ 
tive in (67). 


B.2.3 Derivative of the Cross-Covariance with Respect 
to the Input Distribution 
For the cross-covariance 

n 

A“] = PaiQaiiXi - Pt_i) , 

R '■= -i- A(j, 


we obtain 

dcov/,at_i[At,:rt_i] 

9Pt-i 


= i:t-iR ^ ((** - Mt-i) 


dqi 


+ q^I (80) 


€ for all target dimensions a = 1,..., 

The corresponding derivative with respect to the co- 
variance matrix St_i is given as 

dcov/,at_i[At,&t_i] 

dSt-i 


St-l Pa, {X, - Pt-l) 


dqa 


dSt-i 


(81) 
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